In the following exercises, we will use the data you have collected in the previous session (all comments for the video “The Census” by Last Week Tonight with John Oliver. You might have to adjust the following code to use the correct file path on your computer.

comments <- readRDS("../data/LWT_Census_parsed.rds")

Next, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).

library(tidyverse)

comments <- comments %>% 
  mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
                                            pattern = "\\\n",
                                            replacement = " "))

Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.

library(quanteda)

toks <- comments %>% 
  pull(TextEmojiDeleted) %>% 
  char_tolower() %>% 
  tokens(remove_numbers = TRUE,
               remove_punct = TRUE,
               remove_separators = TRUE,
               remove_symbols = TRUE,
               remove_hyphens = TRUE,
               remove_url = TRUE)

comments_dfm <- dfm(toks, 
                   remove = quanteda::stopwords("english"))

1

Which are the 20 most frequently used words in the comments on the video “The Census” by Last Week Tonight with John Oliver? Save the overall word ranking in a new object called term_freq.
You can use the function textstat_frequency() from the quanteda package to answer this question.
term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)
##       feature frequency rank docfreq group
## 1      census      1763    1    1340   all
## 2      people       991    2     728   all
## 3        just       752    3     654   all
## 4        like       619    4     526   all
## 5         one       520    5     432   all
## 6       trump       514    6     457   all
## 7         can       494    7     432   all
## 8        know       453    8     402   all
## 9        john       438    9     406   all
## 10        get       434   10     386   all
## 11 government       396   11     317   all
## 12   question       394   12     329   all
## 13   citizens       369   13     270   all
## 14         us       368   14     299   all
## 15       many       365   15     315   all
## 16      think       293   16     269   all
## 17       even       292   17     271   all
## 18    country       288   18     240   all
## 19    illegal       281   19     218   all
## 20     oliver       271   20     252   all

2

Instead of the raw frequency we can also look at the number of comments that a particular word appears in. This metric takes into account that words might be used multiple times in the same comment. What are the 10 words that are used in the highest number of comments on the video “The Census” by Last Week Tonight with John Oliver?
You should use the variable docfreq from the term_freq object you created in the previous task.
term_freq  %>% 
  arrange(-docfreq) %>% 
  head(10)
##    feature frequency rank docfreq group
## 1   census      1763    1    1340   all
## 2   people       991    2     728   all
## 3     just       752    3     654   all
## 4     like       619    4     526   all
## 5    trump       514    6     457   all
## 6      one       520    5     432   all
## 7      can       494    7     432   all
## 8     john       438    9     406   all
## 9     know       453    8     402   all
## 10     get       434   10     386   all

We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).

emoji_toks <- comments %>% 
  mutate_at(c("Emoji"), list(~na_if(., "NA"))) %>%
  mutate (Emoji = str_trim(Emoji)) %>% 
  filter(!is.na(Emoji)) %>% 
  pull(Emoji) %>% 
  tokens()

EmojiDfm <- dfm(emoji_toks)

3

What were the 10 most frequently used emojis comments on the video “The Census” by Last Week Tonight with John Oliver?
The solution is essentially the same as the one for the first task in this exercise (word frequencies).
EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 10)
##                            feature frequency rank docfreq group
## 1         emoji_facewithtearsofjoy       114    1      67   all
## 2  emoji_rollingonthefloorlaughing        37    2      21   all
## 3               emoji_thinkingface        30    3      19   all
## 4                 emoji_registered        14    4       4   all
## 5      emoji_grinningfacewithsweat        13    5      11   all
## 6                       emoji_fire        12    6       3   all
## 7      emoji_grinningsquintingface        11    7       7   all
## 8               emoji_unamusedface         9    8       9   all
## 9        emoji_facewithrollingeyes         8    9       8   all
## 10                    emoji_toilet         8    9       5   all

4

The ranking based on raw counts of emojis might be affected by YouTube users “spamming” emojis in their comments (i.e., using the same emojis many times in the same comment). Hence, it makes sense to also look at the number of unique comments that an emoji appears in. What are the 10 emojis that appear in the highest number of comments on the video “The Census” by Last Week Tonight with John Oliver?
The solution is essentially the same as the one for the second task in this exercise (docfreq for words).
EmojiFreq  %>% 
  arrange(-docfreq) %>% 
  head(10)
##                            feature frequency rank docfreq group
## 1         emoji_facewithtearsofjoy       114    1      67   all
## 2  emoji_rollingonthefloorlaughing        37    2      21   all
## 3               emoji_thinkingface        30    3      19   all
## 4      emoji_grinningfacewithsweat        13    5      11   all
## 5               emoji_unamusedface         9    8       9   all
## 6        emoji_facewithrollingeyes         8    9       8   all
## 7      emoji_grinningsquintingface        11    7       7   all
## 8                   emoji_thumbsup         7   11       7   all
## 9               emoji_manshrugging         6   14       6   all
## 10                    emoji_toilet         8    9       5   all

Bonus

If you’re finished with tasks 1-4 and/or want to do/try out something else, you can create an emoji plot similar to the one you saw in the lecture slides for the video “The Census” by Last Week Tonight with John Oliver. We have created a script containing a function for the emoji mapping which you can source with the following code (NB: you probably have to adjust the path to the script in the code below). You might also want to have a look at the comments in the emoji_mapping_function.R file.
source("../scripts/emoji_mapping_function.R")
You need to add the mapping objects to your plot. To see how you can construct the plot, you can have a look at slide #25 from the session on Basic Text Analysis of User Comments.
create_emoji_mappings(EmojiFreq, 10)

head(EmojiFreq, n = 10) %>% 
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_bar(stat="identity",
           color = "black",
           fill = "#FFCC4D") + 
  geom_point() +
  labs(title = "Most frequent emojis in comments",
       subtitle = "The Census: Last Week Tonight with John Oliver (HBO)
       \nhttps://www.youtube.com/watch?v=1aheRpmurAo&t=33s",
       x = "",
       y = "Frequency") +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0,150)) +
  theme(panel.grid.major.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  mapping1 +
  mapping2 +
  mapping3 +
  mapping4 +
  mapping5 +
  mapping6 +
  mapping7 +
  mapping8 +
  mapping9 +
  mapping10